{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Network-Constrained Spatial Clustering\n",
    "\n",
    "Author: [Geoff Boeing](https://geoffboeing.com/)\n",
    "\n",
    "Cluster a set of firms based on their network distances from each other. That is, two locations may be close to each other spatially, but are they close to each other along the network that constrains travel?\n",
    "\n",
    "  - [Documentation](https://osmnx.readthedocs.io/)\n",
    "  - [Journal article and citation info](https://geoffboeing.com/publications/osmnx-paper/)\n",
    "  - [Code repository](https://github.com/gboeing/osmnx)\n",
    "  - [Examples gallery](https://github.com/gboeing/osmnx-examples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import networkx as nx\n",
    "import numpy as np\n",
    "import osmnx as ox\n",
    "import pandas as pd\n",
    "from scipy.sparse import csr_matrix\n",
    "from sklearn.cluster import DBSCAN\n",
    "from sklearn.neighbors import sort_graph_by_row_values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model the street network\n",
    "latlon = (37.8226, -122.2340)\n",
    "G = ox.graph.graph_from_point(latlon, dist=1500, network_type=\"drive\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a fake set of firms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fake data: create n_firms randomly distributed across 3 firm centers\n",
    "n_firms = 30\n",
    "center_latlons = [(37.8175, -122.2316), (37.8216, -122.2439), (37.8268, -122.2286)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# randomly scatter the firms around the centers\n",
    "np.random.seed(0)\n",
    "scale = 0.001\n",
    "size = int(n_firms / len(center_latlons))\n",
    "firm_lats = []\n",
    "firm_lons = []\n",
    "for lat, lon in center_latlons:\n",
    "    firm_lons.extend(np.random.normal(loc=lon, scale=scale, size=size))\n",
    "    firm_lats.extend(np.random.normal(loc=lat, scale=scale, size=size))\n",
    "\n",
    "firms = pd.DataFrame({\"lat\": firm_lats, \"lon\": firm_lons})\n",
    "len(firms)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the firms and the points around which they cluster\n",
    "fig, ax = ox.plot.plot_graph(G, node_color=\"#aaaaaa\", node_size=0, show=False, close=True)\n",
    "ax.scatter(x=firms[\"lon\"], y=firms[\"lat\"], c=\"w\", marker=\".\", s=100, zorder=2)\n",
    "fig.canvas.draw()\n",
    "fig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regular spatial clustering with DBSCAN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# parameterize DBSCAN\n",
    "eps = 300  # meters\n",
    "minpts = 3  # smallest cluster size allowed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compute DBSCAN using great-circle distances\n",
    "eps_rad = eps / 3671000.0  # meters to radians\n",
    "db = DBSCAN(eps=eps_rad, min_samples=minpts, metric=\"haversine\", algorithm=\"ball_tree\")\n",
    "firms[\"spatial_cluster\"] = db.fit_predict(np.deg2rad(firms[[\"lat\", \"lon\"]]))\n",
    "len(firms[\"spatial_cluster\"].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot firms by cluster\n",
    "color_map = {-1: \"w\", 0: \"y\", 1: \"r\", 2: \"c\", 3: \"b\"}\n",
    "point_colors = [color_map[c] for c in firms[\"spatial_cluster\"]]\n",
    "fig, ax = ox.plot.plot_graph(G, node_size=0, show=False, close=True)\n",
    "ax.scatter(x=firms[\"lon\"], y=firms[\"lat\"], c=point_colors, marker=\".\", s=100, zorder=2)\n",
    "fig.canvas.draw()\n",
    "fig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create network-constrained distance matrix\n",
    "\n",
    "Speed up the distance matrix computation: rather than calculating every firm to every firm, find every node with at least 1 firm attached, then calculate every such node to every such node distance. Once we have the node-to-node distances, reindex it to make those distances firm-to-firm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# attach nearest network node to each firm\n",
    "firms[\"nn\"] = ox.distance.nearest_nodes(G, X=firms[\"lon\"], Y=firms[\"lat\"])\n",
    "len(firms[\"nn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get distances for each pair of nodes that have firms attached to them\n",
    "nodes_unique = pd.Series(firms[\"nn\"].unique())\n",
    "nodes_unique.index = nodes_unique.values\n",
    "len(nodes_unique)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def network_distance_matrix(u, D, vs=nodes_unique):\n",
    "    dists = (nx.dijkstra_path_length(D, source=u, target=v, weight=\"length\") for v in vs)\n",
    "    return pd.Series(dists, index=vs)\n",
    "\n",
    "\n",
    "# create node-based distance matrix\n",
    "# convert DiGraph for simpler faster distance matrix calculation\n",
    "D = ox.convert.to_digraph(G, weight=\"length\")\n",
    "node_dm = nodes_unique.apply(network_distance_matrix, D=D)\n",
    "node_dm = node_dm.astype(int)\n",
    "node_dm.size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make distance matrix sparse\n",
    "\n",
    "In a regular distance matrix, zero elements are considered neighbors (they're on top of each other). With a sparse matrix only nonzero elements may be considered neighbors for DBSCAN. First, make all zeros a very small number instead, so we don't ignore them. Otherwise, we wouldn't consider two firms attached to the same node as cluster neighbors. Then set everything bigger than epsilon to 0, so we do ignore it as we won't consider them neighbors anyway."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node_dm[node_dm == 0] = 1\n",
    "node_dm[node_dm > eps] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reindex node-based distance matrix to create network-based distance matrix\n",
    "net_dm = node_dm.reindex(index=firms[\"nn\"], columns=firms[\"nn\"])\n",
    "net_dm.size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert network-based distance matrix to a sparse matrix\n",
    "net_dm_sparse = sort_graph_by_row_values(csr_matrix(net_dm), warn_when_not_sorted=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cluster firms along the network\n",
    "\n",
    "Use the sparse network-based distance matrix to compute DBSCAN (converges much faster and uses much less memory than using the dense matrix with a big data set)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use metric=precomputed to fit model to the sparse network-based distance matrix\n",
    "db = DBSCAN(eps=eps, min_samples=minpts, metric=\"precomputed\")\n",
    "firms[\"network_cluster\"] = db.fit_predict(net_dm_sparse)\n",
    "len(firms[\"network_cluster\"].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot firms by cluster\n",
    "color_map = {-1: \"w\", 0: \"y\", 1: \"r\", 2: \"c\", 3: \"m\"}\n",
    "point_colors = [color_map[c] for c in firms[\"network_cluster\"]]\n",
    "ns = [50 if n in firms[\"nn\"].values else 0 for n in G.nodes()]\n",
    "fig, ax = ox.plot.plot_graph(G, node_color=\"gray\", node_size=0, show=False, close=True)\n",
    "ax.scatter(x=firms[\"lon\"], y=firms[\"lat\"], c=point_colors, marker=\".\", s=100, zorder=3)\n",
    "fig.canvas.draw()\n",
    "fig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compare firms' spatial clusters to network-based clusters\n",
    "firms = firms.reindex(columns=[\"lon\", \"lat\", \"nn\", \"spatial_cluster\", \"network_cluster\"])\n",
    "firms.iloc[4:9]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
